NASA Contractor Report 195292 




High-Performance Parallel Analysis 
of Coupled Problems for 
Aircraft Propulsion 


C.A. Felippa, C. Farhat, S. Lanteri, U. Gumaste, and M. Ronaghi 
University of Colorado 
Boulder, Colorado 


March 1994 


Prepared for 

Lewis Research Center 

Under Grant NAG3-1273 



National Aeronautics and 
Space Administration 


Bi-Annual Report to 

NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 

NASA Lewis Research Center 

Grant NAG3-1273 


HIGH-PERFORMANCE PARALLEL ANALYSIS OF 
COUPLED PROBLEMS FOR AIRCRAFT PROPULSION 


by 

C. A. Felippa, C. Farhat, S. Lanteri, 

U. Gumaste and M. Ronaghi 

Department of Aerospace Engineering Sciences and 
Center for Space Structures and Controls 
University of Colorado 
Boulder , Colorado 80309-0429 


June 1993 


Report No. CU-CSSC-93-16 


Page intentionally left blank 



Summary 


This research program deals with the application of high-performance computing methods 
for the analysis of complete jet engines. We have initited this program by applying the 
two-dimensional parallel aeroelastic codes to the interior gas flow problem of a by-pass jet 
engine. The fluid mesh generation, domain decomposition and solution capabilities were 
successfully tested. 


1. Introduction 


The present program deals with the application of high-performance parallel compu- 
tation for the analysis of complete jet engines, considering the interaction of fluid, thermal 
and mechanical components. The research is driven by the simulation of advanced aircraft 
propulsion systems, which is a problem of primary interest to NASA Lewis. 

The coupled problem involves interaction of structures with gas dynamics, heat con- 
duction and heat transfer in aircraft engines. The methodology issues to be addressed 
include: consistent discrete formulation of coupled problems with emphasis on coupling 
phenomena; effect of partitioning strategies, augmentation and temporal solution proce- 
dures; sensitivity of response to problem parameters; and methods for interfacing multi- 
scale discretizations in different single fields. The computer implementation issues to be 
addressed include: parallel treatment of coupled systems; domain decomposition and mesh 
partitioning strategies; data representation in object-oriented form and mapping to hard- 
ware driven representation, and tradeoff studies between partitioning schemes and fully 
coupled treatment. 


2. Graduate Students 

Two Ph. D. graduate students begin work this summer under support from the grant. 

M. Ronaghi (U.S. citizen) began his doctoral studies at Colorado on January 1993. 
Mr. Ronaghi has a M.S. in Mechanical Engineering at North Carolina A&T State Uni- 
versity at Greensbroro and has worked at NASA Langley doing finite element structural 
analysis. He has a good understanding of structures and composites and some computer 
experience but lacks background in fluid mechanics, thermomechanics and propulsion. He 
will remedy that by initiating a fluid course sequence this Spring semester and will start a 
thermal-propulsion sequence in the Fall semester. 

U. Gumaste (permanent U.S. resident) begins his graduate studies at Colorado in 
the Fall semester, but will work on this project diming June-July 1993 as an hourly R.A. 
Mr. Gumaste has a B.Tech in Civil Engineering from the Indian Institute of Technology, 
Bombay, India. 

Both students were significantly helped by a visiting Post-Doc, Stephane Lanteri, 
during their first modeling assignment. Dr. Lanteri is affiliated with INRIA Antipolis. 
He is working with Charbel Farhat in the development of parallel finite- volume/element 
methods for 2D and 3D flow around aircrafts, and the analysis of nonlinear fluid-structure 
interaction for flutter and stall analysis. 


3. Flow Analysis of a By-Pass Engine 

The main first-year objective is to “turn inside out” our exterior-domain aeroelastic 
codes to fit the interior-flow engine problem. The codes are then run to assess their strength 
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and weaknesses in numerical analysis and capturing physical effects. Observed weaknesses 
are then addressed by a combination of methodology and modeling improvements. 

The gas flow within an engine is very complex. It exhibits localization, vortices, sharp 
pressure gradients and thermal- combustion effects. Our approach is to incorporate gas 
flow and structural modeling common to the exterior problem, and then solicit the help of 
experts to deal with new effects such as compression, diffussion, mixing and combustion. 

To initiate this program we chose a rather old Conway by-pass engine sketched in 
the textbook of Hesse and Mumford [1], Figure 1 is a schematic diagram of the engine 
presented in Hesse-Mumford’s Fig. 11.7. 

The purpose of the first experiments were to verify if the aeroelastic codes could be 
gracefully adapted to confined gas flow. To play it safe we began with a two-dimensional 
model and used the engine structure essentially as a way to provide boundary conditions 
for the gas flow. Blades and combustion effects were ignored. 

The rather complex boundary configuration provided a good test for the fluid mesh 
generator, which “triangulates” the gas domain. This generation was done by S. Lanteri, 
who is an expert in this subject. 

The fluid meshes were treated with Farhat’s domain decomposer program DOMDEC 
[2]. Meshes were partitioned into 8 domains. Figures 3, 4 and 5 show the decompositions 
produced by the Greedy, Reclusive Graph Bisection (RGB) and Reverse Cuthill-McKee 
(RCM) algorithms, respectievly. Ideally each partition should be single-connected to min- 
imize interface communications overhead in parallel machines. Given the complex config- 
uration of the gas domain, satisfaction of this criterion is by no means obvious. It can be 
seen that RGB met the single-connectivity criterion, but that Greedy and RCM did not. 

The theoretical and computational basis of the gas flow calculations are described in 
the Appendix reprint of an article by Farhat, Lanteri and Fezoui [3]. Computations based 
on Stokes flow were carried out without difficulties. Figures 6 and 7 shoe contour plots of 
pressures and density, respectively, for the steady state corresponding to a free-flow Mach 
number of 0.4. Figure 8 shows the velocity field. 

4. Future Work 

The key need is to introduce more physical effects in the gas flow, namely compression, 
diffusion and combustion. We need to decide whether to continue with a two-dimensional 
axisymmetric model with artifices to represent nonaxisymmetric devices, or to proceed to 
a “sector” three-dimensional model requiring tetrahedral meshes. We plan to consult with 
NASA Lewis experts as to the best way to proceed at this point. Dr. Russ Claus of NASA 
LeRC has offered to provide us a three-dimensional model of a more recent engine. Such 
a model could be used as Testbed for the next phases of this research program. 
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Figure 2. 2D Finite Element idealization of by-pass engine 
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Figure 3 Fluid mesh decomposition by Greedy algorithm implemented in DOMDEC 
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Figure 4 Fluid mesh decomposition by Recursive Graph Bisection (RGB) implemented in 
DOMDEC 
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Figure 5 Fluid mesh decomposition by Recursive Cuthill McKee (RCM) implemented in 
DOM DEC 
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Figure 6 Computed steady-state pressure field 
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Figure 7 Computed steady-state density field 
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Appendix 

Theoretical Background on Viscous Flow Computations 


Summary 

The following material, extracted from a recently published paper by Farhat, Fezoui and Lanteri 
[3], summarizes the theoretical foundations of our parallel Navier-Stokes computations on unstruc- 
tured meshes. Although the article focuses on CM-2 computations carried out during 1990-1991, 
it also presents implementation considerations applicable to the present project. 


1. Introduction 

Previously we have reported on our experience with performing two-dimensional structured com- 
pressible flow computations on the Connection Machine CM-2 (Saati, Biringen and Farhat [Al], 
Lanteri, Farhat and Fezoui [A2]). We have found that this massively parallel processor is par- 
ticularly well suited for explicit computations on regular grids. For grids that result in a high 
virtual processor ratio (VPR or VP ratio), using the NEWS fast communication mechanism, we 
have measured the communication component of the simulation time to represent typically less 
than 10% of the total CPU time. We have concluded that on a 64K machine (65536 processors), 
efficiency rates in the neighborhood of 2 gigaflops are attainable. We have also found that for 
both inviscid (Euler equations) and viscous (Navier-Stokes equations) flow structured computa- 
tions, a 16K CM-2 (16384 processors) can be 4 and 6 times faster than one CRAY-2 processor, 
respectively. 

We focus here on massively parallel viscous flow computations using fully unstructured grids. 
In Section 2, we formulate the problem to be solved, and in Section 3, we derive first-order and 
second-order spatial schemes that are characterized by an upwind integration of the convective 
fluxes. Second-order accuracy is achieved through a Monotonic Upwind Scheme for Conservation 
Laws (MUSCL) technique. An explicit, and therefore nicely parallelizable, Runge-Kutta method 
is selected for time integration; it is summarized in Section 4. Because the mesh irregularities 
inhibit the use of the NEWS mechanism, interprocessor communication is bound to be carried 
out via the slower machine router. If a trivial processor mapping is used, up to 60% of the total 
CPU time is consumed in communication requirements. This bottleneck has been previously 
analyzed and documented by Farhat, Sobh and Park [A3] for massively parallel finite element 
computations in solid mechanics problems. It has also been recently addressed by several other 
investigators for fluid flow computations. In particular, Shapiro [A4] has proposed the use of a 
graph coloring algorithm to allow a particular implementation of the communication steps which 
reduces the communication costs by a factor of two. Hammond and Barth [A5] have developed 
a vertex-based partitioning scheme for inviscid flow computations which attempts to minimize 
both the computational and communication costs associated with unstructured grids. Here, we 
present a strategy for mapping thousands of processors onto an unstructured grid which leads to 
an efficient scheme for carrying out communications of an arbitrary pattern. The key elements 
of this strategy are discussed in Section 5. These include the selection of an appropriate parallel 
data structure, the partitioning of a given unstructured grid into subgrids, and the mapping of 
each individual processor onto an entity of these subgrids. Combining this mapping strategy with 
a communication compiler reduces the communication overhead by an order of magnitude and 
brings it down to 15% of the total simulation time. In Section 6, we apply our massively parallel 
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code and its highly vectorized variant to the simulation of low Reynolds number chaotic flows. 
Measured performance results indicate that for such computations on unstructured grids, an 8K 
CM-2 with single precision floating point hardware is as fast as one CRAY-2 processor. 

2. Mathematical modeling 

First we recall the mathematical problem to be solved, and introduce the notation that is used in 
the sequel. 

2.1. Governing equations 

Let ft C ft 2 be the flow domain of interest and T be its boundary. The conservative law form of 
the equations describing two-dimensional Navier-Stokes flows is given by : 

^ + VJ(W)=±;V.li(W) (1) 


where 



( 2 ) 


The functions F(W) and G(W), and R(W) and S(W), denote the convective and diffusive fluxes, 
respectively. They can be written as : 


F(W) = 


G(W) = 


{ pU \ 

pu 2 + p 

puv 

\u(E + p)J 

( P v \ 

puv I 
pv 2 +p I 
\v(E + p)J 


R(W ) = 


S(W) = 


{ 


\UT, 


0 

T X x 
T X y 

XX 4 " VT X y 

0 

T X y 


*yy 


\ 


+ 


2k_d£. I 

Pr dx ' 


\ 


. UT- 


xy 


+ VT yy + 


( 3 ) 


A-2 


where p is the density, U = ( u,v ) is the velocity vector, E is the total energy per unit of volume, 

p is the pressure, and e is the specific internal energy. The variables p, E, p, U , e, and the 
temperature T are related by the state equation for a perfect gas: 

p = ^-l){E-\p\\U\\ 2 ) (4) 


and by: 



( 5 ) 


where 7 denotes the ratio of specific heats. 


The components of the Cauchy stress tensor r xx , r xy and T yy are given by: 







Txy — P 


fdu dv \ 
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( 6 ) 


where p and k are the normalized viscosity and thermal conductivity coefficients. Two character- 
istic numbers appear in the above equations; the Reynolds number Re = 1° 4 . ° . where po, Uo, 

P 0 

Lo and po denote respectively, the characteristic density, velocity, length and diffusivity of the 

flow under consideration, and the Prandtl number Pr = . 

ko 


We consider the initial and boundary value problem (IBVP): 

' v.^(w) = -LvM(w) 

< w =w 0 (X) x en (7) 

w = w r (it) x er = an 


where Wo and Wr are specified functions, and focus on finding a weak solution of (7) that is 
amenable to massively parallel computations. 

2.2. Boundary conditions 

We are mostly interested in external flows around airfoils. Therefore, we consider the case where 
the computational domain ft is delimited by the boundary T = Tf, U Too- We denote by ~v the 
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outward unit normal at a given point of T (Fig. Al). 



Fig. Al. The computational domain 


In the far field, we assume that the viscous effects are negligible so that the flow is uniform. 
We adopt a formulation where the physical variables are non-dimensionalized. The free-stream 
vector Woo is given by: 


Poo 1 



cos a 
sin a 
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where a is the angle of attack and Moo is the free-stream Mach number. On the wall boundary 
Ti, we impose the no-slip condition and specify the temperature: 

~U = T T = T b (9) 


We do not impose any boundary condition on the density. Therefore, the total energy per unit of 
volume and the pressure on the wall are given by : 

E = pC v Tb p = ( 1 -l)E (10) 
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3. Spatial discretization 


3.1. Preliminary 

The flow domain £1 is assumed to be a polygonal bounded region of 3? 2 . Let 75, be a standard 
triangulation of £1, and h the maximal length of the edges of 75, . A vertex of a triangle A is denoted 
by Si , and the set of its neighboring vertices by K{i). At each vertex 5,-, a cell C, is constructed 
as the union of the subtriangles resulting from the subdivision by means of the medians of each 
triangle of 75, that is connected to Si (Fig. A2). The boundary of Ci is denoted by dCi, and the 
unit vector of the outward normal to dCi by 1 7,- = (t/, x , z/, y ). The union of all of the constructed 
cells forms a non-overlapping partition of the domain £1: 


= ( 11 ) 

1=1 



Fig. A2. Cell definition in an unstructured grid 
For each cell Ci, a characteristic function is defined as : 


*.-(*) 


1 if X 6 Ci 

0 otherwise 


Also, the following discrete spaces are introduced: 

V;, = {vh | Vh E C (£1), Vh |a £ Pi, VA £ 75,} 

Wj, = {vk | Vh E L 2 (fi), vh. |c, = Vi = constant, i = l,...,ns} 


( 12 ) 


(13) 
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where P\ is the space of polynomials in two variables and of degree 1. Clearly, any function / 
belonging to Vh is uniquely determined by its values /(5,) at each vertex Si, and can be expressed 
as: 


/(?)= Y, 

t=l y ns 


(14) 


where {lV,}}z” 4 is a basis of Vh. Finally, it is noted that a natural bijection between the spaces 
Vh and Wh can be constructed as: 


V/ G V h , S(f(X))= Y, (15) 

«=l,n« 


3.2 Variational formulation and first order spatial approximations 
A variational formulation of the IBVP (7) goes as follows: 

Find Wh e (VO 4 , V<^ e v h 
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(16) 


We construct a mixed finite volume/finite element (Galerkin) approximation for solving the above 
problem by introducing appropriate schemes for computing the left and right-hand-side integrals of 
(16). Chosing <fih as the shape function Ni associated with the node Si and applying the operator 
5 to the left hand side of (16) leads to a mass-lumped variational approach which transforms the 
above equation into: 


/ 

Cz 


^±dxdy + J V.J{W h )dxdy 

Ci 

= lCe J ^ -KiW^Nidxdy 

SupNi 


(17) 


where SupNi = [J A. Using Green’s formula for the convective term and integrating by part 
a, s,- 6 A 

the diffusive one leads to: 
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(18) 
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where iV/ 1 is the restriction of Ni to triangle A. Finally, we drop the right hand side boundary 
integral as we enforce the viscous boundary conditions in a strong form on Tj and neglect the 
viscous effects on Too, so that equation (18) simplifies to: 


f^dxdy+Y, I 1 > 

J f(W h ).l?H 


idu < 2 > 


dC { nr 6 


J 


+ / F(W h ).1?id<T < 3 > 


ac,nr 0 


— E?./ 


K(W h ).VN?dxdy < 4 > 


A,5, €A ' 


(19) 


where IF/, is the specified value of Wh at the boundaries. 

The reader should note that the above formulation leads to a locally one-dimensional com- 
putation of each convective term, along the normal direction ~v . For this purpose, the boundary 
dCi of the cell Ci is split into bi-segments dCi, which join the middle point of the edge [5, -5)] 
to the centroids of the triangles having both of Si and Sj as vertices (Fig. A3), and the integral 
< 1 > is evaluated as: 


f y{w h ).-?ijd(T = ?&)• f ~tn da ( 2 °) 

1€K(0 J.. 


where jF ({/) is some approximation of the convective flux computed at the interface between cells 
Ci and Cj . 

Following Fezoui and Stoufflet [A6], we choose P(U) to be a numerical flux function $ 
associated with a first-order accurate upwind scheme (Van Leer [A7]). It is denoted here by H\2 , 
where the superscript ^ emphasizes the first order accuracy, and can be written as: 

= (21) 
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where W{ = Wk(S;) and Wj = Wh(Sj). For example, the following numerical flux functions can 
be used to construct ffjp: 


Roe’s Scheme [A8] 


** (u,v, t>) = 


F(U,T?) + F(V,1?) 


— d(U,V, ~u) 


( 22 ) 


where d (pj, V, u ^ is a numerical diffusivity defined as: 


d (V,v, V) =1 A-(w, I (V - 2 - u) 

and W is some mean value of U et V. 


(23) 



Fig. A3. Splitting of dCij 
• Steger and Warming’s scheme [A9] 

(u, V, Jf'j = A+ (u, ~v ) U + A~ (v, V (24) 

where A = A + + A~ and | A |= A + — A~ . 

The viscous integral < 4 > is evaluated via a classical Galerkin finite element FI method 
which results in a centered scheme. Since the approximations of the physical variables are taken 
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in Vj,, the components of the stress tensor and those of VIV,^ are constant in each triangle. The 
velocity vector in a triangle is computed as: 


3 k 

u* = \ E ^ ( 25 ) 

k=l,kcA 

Consequently, the viscous fluxes are evaluated as: 

Y f n(W h ).VN?dxdy= Y ^ea(A) (r a ^~ + (26) 

A,S,-6A" A,S,-6A ' ' 

where JSa and 5 a are the constant values of R(W) and 5(W) in the triangle A. 

3.3. Higher order extension 

The numerical integration with an upwind scheme described above leads to a spatial approximation 
that is only first-order accurate. Here, we focus on constructing a second-order accurate solution 
without changing the space of approximations. We develop a second-order scheme that is an 
extension of Van Leer’s MUSCL method [A7] to the case of unstructured meshes. 

Usually, a second-order approximation requires the evaluation of the gradient of the solution 
at each vertex. Clearly, the gradient of a function vn of is constant in each element and 
discontinuous in the flow domain. Following the MUSCL method, one way to achieve second- 
order spatial accuracy is to evaluate the fluxes with extrapolated values Wij , Wji at the interface 
dCi C\dCj. Basically, this leads to substituting in the previous scheme by H\^ which is given 
by: 


HIP 

Wij = Wi + L( V W)f .5~S* (27) 

Wji = Wj - \(vwf r s~s 3 


where the approximate nodal gradients ( V W)f ■ are obtained via a ^-combination of centered 
and fully upwind gradients : 

* (VW)f = (1 - 0)(VW)f' tnt + ^vf)^" (28) 


Here, a centered gradient (VW)f en< = (VIF)^ 0 can be chosen as any vector satisfying: 


(VW)f ent .SiSj = W 3 - Wi 


(29) 


A nicely parallelizable scheme for computing the upwind gradients (VW)^ pw goes as follows. 
First, we note that (VW^)f p,u = (VW')^ =1 , and from (28) we derive: 
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(30) 


„ „ . 

(VW)f ptu = 2(VW)f“ 2 -(V^ 


\Cent 


We compute the half-upwind gradients (/3 = y) via a linear interpolation of the Galerkin gradients 
computed in each triangle of C,, so that: 




= l y 

area(Ci) 

A eCi 


area(T ) 
3 


3 

y w k v£ 

k=l,keT 


(31) 


Finally, we evaluate the nodal gradients using the following third-order biased scheme: 
(VW)f = 3 = |(VW)f =0 + |(VW)f =1 

= § (VW)f =0 + | ^2(VW)f = 2 - (VW)f=°) (32) 

= i(VW)? = ° + |(VW)f = ^ 

3.4 ■ Boundary conditions 

The second term < 2 > and the third term < 3 > of the right-hand side of (19) contain the 
physical boundary conditions. These are represented by the vector W h. which involves quantities 
that depend on the interior values of Wh, and quantities that are determined by the physical 
boundary conditions. 

Wall boundary : the no-slip condition is enforced in a strong form (9, 10) so that the corre- 
sponding boundary integral < 2 > does not need to be evaluated. 

Inflow and outflow boundaries : at these boundaries, a precise set of compatible exterior data 
which depend on the flow regime and the velocity direction must be specified. For that purpose, 
a plus-minus flux splitting is applied between exterior data and interior values. More precisely, 
the boundary integral < 3 > is evaluated using a non-reflective version of the flux-splitting of 
Steger and Warming [A9] : 

f(W h ).l?ido = A + ioo).Wi + A~(Wi,l/i 00 ).W 00 (33) 

or 00 
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4. Time discretization 


The resulting semi-discrete fluid flow equations can be written as: 

^-+^W) = 0 (34) 

Because it lends itself to massive parallelism, the explicit Runge-Kutta method is selected for 
integrating the above equations. A 3-step variant is used here. It is summarized as : 

{ w (0) = W n 

W< k) = W <0) - 1} ) k = 1,2,3 (35) 

W rn+1 = W (3) 


The above scheme is often referred to as the low-storage Runge-Kutta method as only the solution 
at substep a — 1 is used to compute the one at substep a. It is third-order accurate in the linear 
case, but only second-order accurate in our case. 


5. Parallel implementation on the Connection Machine CM-2 

Clearly, expressions (19) and (27-35) reveal that both the spatial and temporal integrations are in 
principle nicely parallelizable. In this section, our interest lies in investigating the most efficient 
way to implement these computations on a Single Instruction Multiple Data (SIMD) massively 
parallel computer such as the Connection Machine CM-2. Special care is given to interprocessor 
communication because mesh irregularities: (a) inhibit the exploitation of the NEWS grid, so 
that the relatively slow router must be used, and (b) induce a different amount of communication 
steps within each processor, which is not particularly desirable on a SIMD machine. Rather 
than overviewing the CM-2, we refer the reader to the technical summary of Thinking Machines 
[A10] for architectural details, and to Farhat, Sobh, and Park [A3] for an in-depth analysis of 
interprocessor communication on the CM-2 when computing over an irregular mesh. 

5.1. Parallel data structure 

Behind the performance of any parallel algorithm lies the choice of the corresponding parallel 
data structure. The latter is closely related to both the entity and the task to be assigned to 
each processor. Therefore, all of the computational, communication and memory requirements 
should be considered before the distributed data structure is determined. For the mixed finite 
volume/finite element method presented here, we consider four candidates for a fundamental 
entity (Fig. A4): 

• the vertex Si , 

• the edge Eij joining the vertices Si and Sj, 

• the element (here the triangle) A ijk connecting the vertices Si, Sj and Sk, 
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and the cell Ci defined in Section 3.1. 



Fig. A4. Fundamental entity candidates 


Memory considerations 

While regular grids are most often characterized (in terms of memory requirements) by their 
number of vertices Nv, irregular triangular grids can be also characterized by either their number 
of elements Na, or by their number of edges Ne ■ Here, we assume for simplicity that Tk is 
characterized by its number of vertices. Euler’s relations for a triangulation state that : 

Nv + Na — Ne = 1 
2 Ne — Nbv = 3Na 



where Nbv denotes the number of vertices at the boundary of the triangulation. This implies 
that : 


N a * 2 Nv and N E « 3 Nv (37) 

Therefore, if Tk is designed, for example, so that its number of vertices matches a given Connection 
Machine size, the VP ratio associated with each data structure candidate varies as indicated below: 

Vertex Edge Element Cell 


VPR 1 321 

The reader should note that for the edge case, the machine automatically selects a VP ratio of 4, 
since it is the closest power of two to the theoretical VPR. Clearly, the vertex and cell entities are 
the best candidates on the sole basis of efficient memory usage. 
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Operation count 

The numerical algorithms discussed in Section 2 and Section 3 can be organized around three 
basic computational steps : 

(Step a) evaluation of the Galerkin gradients (32), 

(Step b) evaluation of the diffusive fluxes (26), 

(Step c) and evaluation of the convective fluxes (27). 

While Step (c) is most efficiently performed using edge-wise computations, Step (a) and Step 
(b) are inherently element-level calculations. Therefore, whatever fundamental entity is selected, 
it must contain both edge and element information, which rules out the edge Eij data structure. 

On the other hand in an element-based partition, every triangle A ijk provides direct access 
to all of the three edges Eij , Ejk and Eki. However in that case, two VP sets must be used; 
one containing V A processors which store triangle related data (geometrical data), and another 
one containing Nv processors which store vertex related data (physical data). Otherwise, if only 
one set of virtual processors is used and assigned to both triangle and vertex data, a nodal result 
would be duplicated in as many processors as there are triangles connected to that vertex. 

The vertex entity Si is an effective candidate only when augmented with the auxiliary data 
structures that can handle the data associated with the elements and edges connected to a given 
vertex — that is, when transformed into a cell data structure. 

Finally, we note that the cell entity stores both vertex and element data, and therefore 
provides access to ail of vertex, element and edge information. Consequently, only element and 
cell partitions are retained for further discussions. 

Next, we evaluate the operation count for each of Step (a), Step (b) and Step (c), as- 
suming an element- or cell-based data structure. We denote by C ^ and C^, the number of 
arithmetic operations associated with one edge computation during Step (c), and with one tri- 
angle computation during Step (a) and Step (b), respectively. The computational complexities 
characterizing the two retained candidates are tabulated below. 

Element Cell 


Step (c) 2 x Cf 2 x Cf 

Step (a) + Step (b) C? b 3 X C£ h 

In both an element- and cell-based partition, an edge is shared by two virtual processors, so 
that the flux across [5,-5j] is computed twice. Only an edge partition would eliminate 
these redundant computations, but that choice has already been eliminated. In a cell-based 
partition, a triangle A ijk is shared by three virtual processors, and therefore additional redundant 
computations are generated. 

Communication costs 

The computational steps discussed above require four communication steps denoted here by (cl), 
(c2), (c3), and (c4). These are discussed below for the element and cell parallel data structures. 
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First, we consider the case of an element-based partition. During the first communication 
step (cl), each virtual processor assigned to a triangle A ijk gets the physical states at vertices 
Si, Sj and Sk from neighboring processors. Then, the computations in Step (a) and Step (b) 
are carried out. During the second communication step (c2), the element-wise results are sent 
back to the virtual processors holding vertex data. The latter virtual processors use these values 
to compute the nodal gradients (32) and diffusive fluxes (26). In step (c3) the nodal gradients 
are communicated to neighboring processors. Next, each virtual processor evaluates three second- 
order convective fluxes (15) across the three edges connected by triangle A,-,-*. During the last 
communication step (c4), the edge- wise fluxes are sent to the virtual processors holding vertex 
data. 


Communication with a cell-based partition is more complex, as each cell may have a differ- 
ent number of neighbors. However, fewer communication steps are needed because each virtual 
processor stores within its local memory all of the element-wise values that are necessary for the 
evaluation of the nodal gradients and the diffusive fluxes, as well as the elemental convective 
fluxes. 

The communication count associated with the four steps (cl) to (c4) is tabulated below 
for each of the two retained data structure candidates. denotes the maximum number of 

neighboring cells. 

Element Cell 


(cl) 3 

(c2) 3 

(c3) 3 

(c4) 6 

Selected candidate 

The operation and communication counts are summarized below for both the element and cell 
data structures. Equations (36) are used to express the results in terms of the number of vertices 
in the mesh. 

Element Cell 


jvmax 
11 neigh. 


o 

ma 

act 

o 


\rmax 
^ neigh 


Operation count (6 x Cf + 2 x C^ b ) x Nv (6 x Cf + 6x C£ b ) x Nv 

Communication count 30 X Nv 12 X Nv 

Clearly, redundant arithmetic operations can be avoided only at the expense of additional com- 
munication characterized by an irregular pattern, which is usually not beneficial on a massively 
parallel processor such as the CM-2. Therefore, we have chosen the cell-based parallel data struc- 
ture and have accepted the additional cost of redundant flux computations. Hammond and Barth 
[A5] have invoked a graph theory result due to Chrobak and Eppstein [A17] to eliminate redun- 
dant edge-based flux computations for Euler flows. This result states that for any planar graph, 
there exists an orientation of the edges such that no vertex has more than three edges directed 
out from it. This means that there exists a cell partition where no processor needs to compute the 
convective fluxes across more than three edges of the computational cell. However, this graph the- 
ory result does not apply for our viscous computations because these also include element-based 
operations. 
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5.2. Grid decomposition and processor mapping 

Efficiency in arbitrary communication on the CM-2 requires the minimization of both the “ham- 
mering” on the router — that is, wire contention, and the distance that information has to travel 
— that is, the number of hops between the sender and receiver processors. Here, this implies that 
: (a) adjacent cells must be assigned, as much as possible, to directly connected processors or 
processors that are lying in directly connected chips, and (b) contention for the wire connecting 
neighboring chips must be reduced. 

In a first step, the unstructured grid is decomposed into a series of subgrids each containing 
16 adjacent numerical cells. Each subgrid is assigned to a certain CM-2 chip that is subsequently 
identified, so that adjacent cells within a subgrid are assigned to directly connected processors lying 
in the same chip. As a result, off-chip communication is needed only across the subgrid boundaries. 
Wire contention is reduced if each of the defined subgrids is surrounded by the largest possible 
number of neighboring subgrids. Indeed, wherever a subgrid boundary is shared with several other 
subgrids, off-chip communication is split between distinct chips and is distributed across several 
of the available inter-chip wires (Fig. A5). On the other hand, if for example a subgrid is adjacent 
only to two other subgrids, a maximum of two wires can be used during off-chip communication, 
which may create a severe wire contention that would serialize communication and significantly 
increase its cost. Here, we use the mesh decomposer of Farhat [All] which has proven to be very 
effective at reducing wire contention on the CM-2 (Farhat, Sobh and Park [A3]). 


WIRE 1 



Fig. A5 .Grid decomposition with reduced wire- contention 

The next step is to reduce the distance that information has to travel during off-chip commu- 
nication, that is when data is exchanged between centers of cells that are assigned to processors 
lying on different chips. This can be achieved by assigning adjacent subgrids as far as possible to 
directly connected chips. A combinatorial optimization-like procedure known as Simulated An- 
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nealing (see, for example, Flower, Otto and Salama [A12]) is probably the most popular technique 
for tackling this mapping problem. However, it is a very expensive procedure which has often 
proved to be impractical. Alternative heuristic-based schemes have been developed by several au- 
thors including Bokhari [A13], Farhat [A14], and recently Hammond and Schreiber [A15]. In this 
work, we have adopted the mapper of reference [A14]. It is based on a combined greedy /divide 
and conquer approach and is tuned for hypercube topologies. 

A detailed analysis of interprocessor communication on the CM-2 for unstructured grids can 
be found in Farhat, Sobh and Park [A3]. In that reference, it is shown that mesh irregularities 
induce an MIMD (Multiple Instruction Multiple Data) style of programming for the communi- 
cation phase which dominates the cost of communication. It is also suggested that since the 
irregular pattern of communication is fixed in time, a considerable improvement can be achieved 
if that pattern is evaluated during the first time step, then compiled or stored in the CM-2 for 
re-use in subsequent time steps. However, no software was available at that time for validating 
the proposed communication strategy. Recently, a communication compiler prototype has become 
available (Dahl [A16]) and can be used for storing the routing pattern. In Section 6, we report 
on its performance. 

6. Numerical Experiments 

(This Section reports on numerical experiments on the CM-2 and Cray 2. Since airfoil problems 
are of limited important for the present research, they are not presented here.) 

7. Closure 

Mixed finite volume/finite element spatial schemes for fully unstructured grids are developed and 
implemented on the CM-2, and applied to the simulation of two-dimensional viscous flows. Second- 
order accuracy in the discretization of the convective fluxes is achieved through a Monotonic 
Upwind Scheme for Conservation Laws (MUSCL) technique. The diffusive fluxes are computed 
using a classical Galerkin finite element method, and the resulting semi-discrete equations are 
time integrated with an explicit Runge-Kutta algorithm. 

A strategy for mapping thousands of processors onto an unstructured grid is presented. 
Its key elements are given by the selection of an appropriate parallel data structure, the careful 
partitioning of a given unstructured grid into specific subgrids, and the mapping of each individual 
processor onto an entity of these subgrids. Whenever the communication patterns are compiled 
during the first time step, the total time elapsed in interprocessor communication using the router 
is drastically reduced to represent only 15% of the total CPU time of the simulation. 
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